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Abstract —While most current high-throughput DNA sequencing 
technologies generate short reads with low error rates, emerging 
sequencing technologies generate long reads with high error rates. 
A basic question of interest is the tradeoff between read length 
and error rate in terms of the information needed for the perfect 
assembly of the genome. Using an adversarial erasure error model, 
we make progress on this problem by establishing a critical read 
length, as a function of the genome and the error rate, above which 
perfect assembly is guaranteed. For several real genomes, including 
those from the GAGE dataset, we verify that this critical read length 
is not significantly greater than the read length required for perfect 
assembly from reads without errors. 

I. Introduction 

Current DNA sequencing technologies are based on a two-step 
process. First, tens or hundreds of millions of fragments from 
random locations on the DNA sequence are read via shotgun 
sequencing. Second, these fragments, called reads, are merged 
to each other based on regions of overlap, using an assembly 
algorithm. 

Roughly speaking, different shotgun sequencing platforms can 
be distinguished from the point of view of three main metrics: 
the read length, the read error rate, and the read throughput. 
In the last decade, the so-called next-generation sequencing 
platforms have attained considerable success at employing heavy 
parallelization in order to achieve high-throughput shotgun se¬ 
quencing. This allowed a significant reduction in the cost and 
time of sequencing, causing an explosion in the number of new 
sequencing projects and the generation of massive amounts of 
sequencing data. 

In order to guarantee low error rates, most of these next- 
generation technologies are restricted to short read lengths, 
shifting some of the burden of sequencing to the assembly step. 
In practice, this results in very fragmented assemblies, with large 
gaps and little linking information between fragments [1]. On the 
other hand, recent technologies that generate longer reads suffer 
from lower throughput and much higher error rates h 

Given this technology trend, the natural questions to ask are: 
what is the impact of read errors on the performance of assem¬ 
blers? Is the negative impact of read errors more than offset by 
the increase in read lengths in long-read technologies? It is well 
known that read errors have a significant impact on assembly al¬ 
gorithms. For example, in DeBruijn graph based algorithms, read 
errors create extraneous nodes and edges in the assembly graph, 
which results in added complexity. However, these observations 
pertain to specific algorithms. A more fundamental question can 
be asked from an information-theoretic point of view: given a 
read length, an error rate and a coverage depth (number of reads 
per base), is there enough information in the read data to uniquely 
reconstruct the genome? Do errors significantly increase the read 

^One example of a short-read-length technology is Illumina, with reads of 
length ~ 200 base pairs and error rates of about 1%. In contrast, PacBio reads 
can be several thousand base pairs long, with error rates of about 10-15%. 


length and/or coverage depth requirements? An answer to these 
basic feasibility questions can provide an algorithm-independent 
framework for evaluating different sequencing technologies. It 
would also settle some speculations in the assembly community 
on whether read errors have a significant impact in long-read 
technologies (see for example [2]). 

Such a framework was initiated in [3] for error-free reads: 
a feasibility curve relating the read length and coverage depth 
needed to perfectly assemble a genome was characterized in 
terms of the repeat complexity of the genome (see examples in 
Fig. 1). Evaluating this curve on several genomes revealed an 
interesting threshold phenomenon: if the read length is below a 
certain critical value ^crit, reconstruction is impossible; a read 
length slightly above ^crit and a coverage depth close to the 
Lander-Waterman depth clw (i-C-. just enough reads to cover 
the whole sequence) is sufficient. The critical read length ^crit 
is given by the length of the longest interleaved repeat in the 
genome, and coincides with the minimum read length L needed 
to uniquely reconstruct the genome given its L-spectrum, i.e. the 
set of reads with one length-L read starting at each position of 
the sequence, illustrated in Fig. 2. This minimum read length 
also appeared in earlier works by Ukkonen and Pevzner [4, 5] 
for reconstruction via sequencing by hybridization. 

Given this framework, the impact of read errors can be studied 
by asking how much the critical read length Icvit increases when 
there are errors. In this paper, we investigate this tradeoff for a 
specific error model: 1) the errors are erasures; 2) the erasures 
occur at a rate no larger than D/L for each read and for each 
base in the sequence, but are otherwise arbitrary. Our main result 
is the characterization of a critical read length Icvit above which 
perfect assembly is always possible. While in the noiseless case 
^crit is a function of the sequence repeat structure, ^crit depends 
more generally on the error rate and on the approximate repeats 
in the sequence. More concretely, for a sequence s. 


^crit(s,D)= min k ^ D ' MJD^k ^ 1)^ 

/C>4rit(s) 

where Ms{D,i) is the maximum number of D-approximate 
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(b) R. sphaeroides 


Fig. I. The thick black curve is a feasibility lower bound for any algorithm, and 
the green line represents the performance of the Multibridging algorithm [3]. 
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Fig. 2. The sequence s and its L-spectrum, 7 ^l,o(s). 

length-^ repeats in s. Moreover, reminiscent of classical coding 
theory results, we show that the same read length ^crit is sufficient 
for assembly if instead of erasures we consider substitution errors 
at half of the rate. In order to characterize ^crit. we derive a new 
result about the error correction capability of the L-spectrum. 
More precisely, we show that given a noisy version of the L- 
spectrum of a sequence, it is possible to obtain the noiseless 
{k -f 1)-spectrum of the same sequence, for any k such that 
L > k D • Ms{D, k + 1). When L > 4rit, we can obtain the 
noiseless {k -\-l) spectrum for some k > ^crit. and the noiseless 
result from [3] implies that perfect assembly is possible. 

By evaluating Icvit on several real genomes, including those 
in the GAGE dataset [6], we verify that ^crit is not significantly 
larger than -^crit- In fact, in most cases, 4rit ~ ^crit+3D. Hence, if 
the read length L is chosen above the noiseless requirement ^crit. 
perfect assembly is robust to errors up to a threshold (roughly 
\{L — Icvit) erasures per read). 

The impact of read errors on the information theoretic limits 
of genome assembly has also been studied in the setting of an 
i.i.d. genome model and asymptotically long genome length [7], 
building on an earlier work on error-free reads in the same setting 
[8]. The results are surprising: as long as the error rate is below a 
threshold (which can be as high as 19% for substitution errors), 
noisy reads are as good as noiseless reads; i.e., the requirements 
for assembly in terms of read length and coverage depth are the 
same in both cases. While this result seems stronger than the 
result in the present paper, it is proved under the idealistic and 
unrealistic settings of i.i.d. genome statistics and i.i.d. errors. The 
present result, on the other hand, is more robust as it applies to 
arbitrary genome repeat statistics and error statistics. 

II. Problem Setting 

In the DNA assembly problem, the goal is to reconstruct a 
sequence s = (s[l],..., ^[G]) of length G with symbols from the 
alphabet E = {a,c, In order to simplify the exposition, 
we assume a circular DNA model; thus, is a periodic 

sequence with (minimum) period G. Our results hold in the non¬ 
circular case as well under minor modifications. We will let s| 
be the substring of length £ starting at 5[i]; i.e., sf = (^[i], s[i -|- 
1 ]). 

The sequencer provides a multiset of N reads IZ = 
{ri,...,r 7 v} from s, each of length L. In the noiseless case, 
each read is a length-L substring of s with an unknown starting 
location. Our focus, however, will be on noisy read models, where 
each read may be corrupted by noise. The goal is to design 
an assembler, which takes the set of reads IZ and attempts to 
reconstruct the sequence s. 

A. The L-Spectrum Read Model 

We will consider a “dense-read” model, in which all the reads 
in the L-spectrum of s are provided. More precisely, IZ will have 
exactly G reads, one from each possible starting position; i.e., 
IZ = {ri,..., tg}, where = sf for i = 1,G. We will refer 


to the error-free L-spectrum of s by 7 ^l,o(s). Notice that the 
starting position i for each read is unknown to the assembler. 

While such a read model was originally proposed in the context 
of sequencing by hybridization [4, 5, 9], our motivation for using 
it comes from next-generation sequencing technologies, where 
the high read throughput can provide large coverage depths at 
low costs, and a dense read regime is not unrealistic. This way, 
we can bypass the question of the necessary coverage depth for 
assembly, and instead focus on the interplay between read length 
and error rate in the context of assembly feasibility. Moreover, as 
shown in [3] for noiseless reads, the dense-read model provides 
valuable insights towards understanding the information-theoretic 
limits of reconstruction in the more general shotgun read model. 

In the L-spectrum read model, since we have exactly G reads, 
an assembly of the reads 7 ^l,o(s) = {ri,ro} can be thought 
of as a permutation a of the entries of (1,..., G). We assume with¬ 
out loss of generality that the identity permutation ao = (1,..., G) 
yields a correct assembly of s. Notice, however, that the index i 
of each read is unknown to the assembler. Notice also that in 
general, there may be multiple correct assemblies for a sequence 
s if = Yj for some i % j. 

B. Adversarial Erasure Model 

As in the classical coding theory literature, we will study the 
problem of DNA assembly with noisy reads from the perspective 
of an adversarial noise model. Given that actual sequencing 
noise profiles are complex (non-i.i.d., asymmetric across bases) 
and technology-dependent, this approach avoids the need for a 
probabilistic noise model by instead focusing on a worst-case 
scenario. Moreover, under this model we can hope to obtain 
deterministic and non-asymptotic conditions for perfect assembly, 
which can be more easily analyzed in terms of real genome data. 

Motivated by the fact that sequencing technologies usually 
provide a quality score for each base that is read (which could be 
thresholded into “good” and ’’bad” bases), and in order to simplify 
the problem, we will consider an erasure model. The reads in IZ 
will be length-L sequences from the alphabet E' = {a, c, t, e}, 
where 5 corresponds to an erasure. Thus, a read starting at 
position i from s can be written as = (ri[0],..., ri[L — 1]), 
where either = s[i ^ j] or = 5, for 1 < i < G and 
0 < j < L - 1. 

For a fixed parameter D, the adversarial erasure model will 
be constrained by a maximum error rate of DjL within each 
read, and for each base. Since in our read model each base 5[i] 
is read L times (r^_(i,_i)[L- 1], r^_(i,_ 2 )[L- 2], ...,ri[0]), these 
constraints can be written as follows: 

a) There are at most D erasures per read. 

b) Each base 5[i] is erased at most D times across all reads. 
We will use TZl^d{^) to refer to the L-spectrum of s, TZl,o{s), 
after being corrupted by erasures satisfying (a) and (b). 

In the context of an adversarial noise model with deterministic 
constraints, it makes sense to restrict our attention to potential 
sequences s that are consistent with the reads TZl,d{^)- A 
sequence s is said to be consistent with 7 ^l,d(s) if it could 
have generated the set of reads according to the erasure 

model in (a) and (b). By extension, we will say that an assembly 
a of TZl,d{s) is consistent if there exists a sequence s, consistent 
with TZl^d{^)^ that could have generated the reads in TZl^d{^) 
according to the positions determined by cr. As illustrated in 
Fig. 3, we notice that (b) guarantees that a consistent assembly a 
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Fig. 3. Part of a consistent assembly for L = 5 and D = 2. Notice that there can 
be at most D erasures per read and per “column” of the assembly a. Moreover, 
all non-erased bases in a column must agree. 

defines, up to cyclic shifts, a unique consistent sequence in 
which we will refer to as s(cr). 

The fundamental feasibility question corresponds to asking 
which values of L allow unambiguous reconstruction. Formally, 
it corresponds to the following algorithm-independent question. 

Question 1. Consider a fixed circular sequence s G What 
values of L guarantee that, for an arbitrary set of erased reads 
s is the unique sequence consistent with 

III. Assembly in the Noiseless Case 

The assembly problem in Question 1 was first studied in [4] 
in the noiseless setting D = ^. Notice that when L = 1, 7^i,o(s) 
is simply the multi-set {s[l],..., s[G]} and any permutation 
a of (1,...,G) is a consistent assembly. Hence, s cannot be 
reconstructed unambiguously, unless all of its symbols are the 
same. On the other hand, when L = G, there is a unique 
assembly of 7 ^g,o(s) = {s}, and s can always be reconstructed 
unambiguously. Question 1 is thus equivalent to asking for the 
threshold £th for which s can be reconstructed if and only if 
L > ith- In [4], this threshold is established as a function of the 
repeat structure of the sequence s, as we explain next. 

A repeat of length ^ in s is a subsequence appearing twice 
at some positions G and t 2 (so and sf^) that is maximal; 
i.e., s[ti — 1 ] s[t2 — 1] and s[ti -\- £] s[t2 -\- £]. Two pairs of 
repeats ^bi ’ ^62 interleaved if ai < 61 < a 2 < 62 . 

Due to the circular DNA model, since a subsequence sf can also 
be written as for any integer m, we additionally require 

that 62 — ai < G. The length of a pair of interleaved repeats 
defined to be mm{£,k). We let Gnter(s) 
be the length of the longest pair of interleaved repeats in s and 
set Grit(s) = Anter(s) + 1. The results from [4, 5] imply the 
following: 

Theorem 1 . If L > ^crit (s), then s is the unique sequence that is 
consistent with IZl,C onversely, if L < Grit(s), there exists 
a sequence s' 7 ^ s that is also consistent with 7^l,o(s). 

In other words. Theorem 1 characterizes the threshold on L 
that fully answers Question 1. We point out that, in the previous 
literature [3, 4], £cvit was defined in terms of the length of pairs 
of interleaved repeats (defined in a more restrictive way) and 
the length of triple repeats. However, one can verify that by 
considering the more general definition of interleaved repeats 
above, triple repeats are included as a special case. 

Notice that, while Theorem 1 characterizes the minimum L 
that guarantees perfect reconstruction, ^crit(s) is a function of the 
ground truth s, and is not known a priori. However, the following 
corollary of Theorem 1 readily follows: 

Corollary 1 . If a sequence s is consistent with 7^l,o(s) and 
L > 4 rit(s), then s = s. 


Since ^crit(s) can be computed from the assembled sequence 
s, this result means that L > ^crit(s) provides a certificate that 
s = s, even without previous knowledge of ^crit(s). 

IV. Main Results 

In the previous section, we described how Theorem 1 fully 
characterizes when assembly is possible given the noiseless L- 
spectrum. In this section, we seek a similar characterization in 
the case where reads are noisy. 

Notice that for the erasure setting described in Section II, one 
possible erasure pattern is to have the last D bases from each 
read erased, which effectively results in noiseless reads of length 
L — D. Therefore, the converse part of Theorem 1 implies that, 
if L < Grit(s) + D, there is a read set and a sequence 

s 7 ^ s that is consistent with But how much larger 

than ^crit (s) + D does the read length L have to be in order to 
guarantee unambiguous correct reconstruction? In other words, 
how do erasures degrade the fundamental limit characterized by 
Theorem 1? 

Our main result is the introduction of a new sequence- 
dependent quantity, Grit (^5 s), such that, if L > Grit(s, D), 
s is the unique sequence consistent with 7 ^l,e>(s). In general, 
Grit(s) D < Grit(s,D) for D > 0, and one can construct an 
arbitrary sequence s G for which the gap between the two 
quantities is significant. However, by computing ^crit + D and 
^crit for actual genomes, we verify that they are often close, as 
shown in Table I. 

Rather than being defined in terms of exact repeats, as is the 
case of ^crit(s), Grit(s) depends more generally on approximate 
repeats. For a set of segments 5 of a given length i.e., 5 C 
we will first define the radius of S to be 

piS) = min max(ii/(y, x), (1) 

ye5 

where dniy,^) is the Hamming distance between y and x. 
We will say that the segments in S are d-approximate copies if 
p{S) < d. Intuitively, a sequence s that contains a large set S of 
length-^ segments with a small radius p{S) has more ambiguity in 
terms of assembly. To capture that, we will let M{d, £) correspond 
to the maximum number of d-approximate length-^ segments in 
s; i.e., 

Ms{d,£) = max{|iS| : S C 7^^,o(s), p(5) < d} . (2) 

Notice that Ms{d,£) is monotonically decreasing in £. We let 

4rit(s,D)= min k + D • Ms{D,k ^ 1). (3) 

/C>4rit(s) 

Notice that Grit(s, D) > Grit(s, 0) = Grit(s). Our main result is 
the following. 

Theorem 2 . If L > Grit(s,D), then s is the unique sequence 
that is consistent with 

The main tool used to prove Theorem 2 is a result about 
spectrum error correction. More precisely, we show that from 
a noisy version of the L-spectrum of s 7 ^l,d(s), it is possible 
to obtain 7 ^l',o(s), for some effective read length L' < L. This 
result and the proof of Theorem 2 are presented in Section V. 

As in the noiseless case, we point out that ^crit(s,D) cannot 
be computed a priori, since it is a function of the ground truth 
sequence s. However, Theorem 2 can in fact be used to obtain a 
certificate result analogous to Corollary 1, allowing one to certify 












whether an assembly s is correct, even without prior knowledge 
of 4rit(s) and Ms{D, •). 

Corollary 2. If a sequence s is consistent with IZl,d{^) 

L > 4rit(s), then s = s. 

Proof: If s is consistent with by the definition of 

consistency, can be viewed as a set of reads 

from s, with an erasure pattern satisfying (a) and (b). But from 
Theorem 2, if L > ^crit(s), s is the unique sequence that is 
consistent with 7 ^l,d(s) = Since s must also be 

consistent with we must have s = s. ■ 

In Table I, we show the value of ^crit(s,T)) computed for 
several real genomes. Computing 4rit(s, D) is generally imprac¬ 
tical from a computational standpoint, so the values in Table I 
are based on heuristics implemented by a sequence alignment 
tool called Nucmer [10]. We choose the value of D such that 
D/icrit ~ 15%. We point out that the first two genomes, R. 
sphaeroides and S. aureus are from the GAGE dataset [6], which 
is used as a benchmark for assemblers. Notice that, with the 
exception of E. coli 536, in all cases ^crit(s, D) = 4rit(s) 
for m G {2,3,4}. This occurs because, for the genomes consid¬ 
ered, ^crit(s) is already long enough so that there aren’t many 
approximate repeats of that length. 


Genome (s) 

-^crit (s) 

4rit(s,79) 

D 

R. sphaeroides 

111 

331 

30 

S. aureus 

1799 

2399 

200 

A. ferrooxidans 

2628 

3228 

300 

E. coli 536 

3245 

4462 

450 

E. coli K-12 

1744 

2544 

200 


TABLE I 

Computed 4rit(s, D) for DlEvit ~ 15% 


While the results in this section were presented for an erasure 
model, they can be extended to a substitution error model. In 
fact, if instead of D erasures per read and per base, we have 
D/2 substitution errors, the proofs of Theorems 2 and 3 can 
be modified accordingly, and the statements still hold. We will 
restrict the discussion to the erasure case for simplicity. 

V. Spectrum Error Correction 

The main result we use to prove Theorem 2 is a statement 
about when it is possible to take a noisy L-spectrum of s and 
unambiguously construct its noiseless L'-spectrum, for L' < L. 

Theorem 3. Suppose that, for some k, we have 

L >k^D’M^{D,k^l). (4) 

Then, for any sequence s that is consistent with 
^/c+l,o(s) = 7^/c+l,o(s). 

Theorem 3 says that, by finding a consistent assembly of 
^L,T>(s), we can obtain the (noiseless) {k + l)-spectrum of s, 
as long as k satisfies (4). Therefore, when L > 4rit(s,T)), if 
we let k'^ be the minimizer in (3), we have that L > k^ D ■ 
Ms{D, k^ -\-l) and, by Theorem 3, any s that is consistent with 
^L,u(s) has the same {k'^ + l)-spectrum But since, 

/c* + 1 > ^crit(s), Theorem 1 implies that there is only one 
sequence that is consistent with and we must have 

s = s. This proves Theorem 2. 

Next, we turn to the proof of Theorem 3. Suppose that we 
pick some k satisfying (4) and that a is a consistent assembly 
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Fig. 4. We place an edge (n, v) in (14, Vg, E) if . In this example, 

(ti, ti), (t 2 , ^ 2 ) and (ri, 4) are some of the edges in (14, Vg,E). 

for the set of reads with assembled sequence s = s(cr). 

The main idea of the proof is to show that {k + 1)-blocks in s 
and s are in one-to-one correspondence; i.e., for 

a bijective mapping r : {!,..., G} ^ {!,..., G}, which implies 

^/c+l,o(s) = 7^/c+i,o(s). 

In order to show the existence of this bijection r, we consider 
a bipartite graph (14, Vg, ^’/c+i), where I 4 = Vg = {1, •••, G} and 
E = {{u,v) e Vs X Vs : as illustrated in Eig. 4. 

The existence of the bijective mapping r is equivalent to the 
existence of a perfect matching in (Vs^Vs^E). Hence, Theorem 3 
is equivalent to the following: 

Claim 1. There exists a perfect matching in (I 4 , Vg, E"). 

Eor a set of nodes U C Vg, we let 6{U) = {u G I 4 • ('f’, u) G 
E for G f/} be the set of neighbors of U. We will show that, 
for any U C Vg, \S{U)\ > \U\, and by Hall’s marriage theorem. 
Claim 1 will follow. We will first state the following lemma, 
which establishes \S{U)\ > \U\ for the special case of sets U of 
the form Gx = {fx G Vg : = x} for some x G 11^+^. 

Lemma 1. Eor the bipartite graph (I 4 , Vg,E), |(^(Ex)| > |Ex|, 
for any x G 

The proof of Lemma 1 is at the end of this section. Now 
consider a general set G G Vg. Let G 11^+^ : u G 

U}. Since two nodes u^u' G U with cannot be 

connected to the same node u G 14. we have 

l<5(E)l= \s{u^nu)\= 

> E E \u.nu\ = \u\, 

where the first inequality follows from Lemma 1. By applying 
Hall’s theorem. Claim 1 follows, implying that, = 

E/c+i,o(s). Therefore, to conclude the proof of Theorem 3, we 
just need to prove Lemma 1. 

Proof of Lemma 1: Let Gx = {ti,...,^} C Vg, where 
are distinct and = ... = = x. Consider one 

such block for t G {4,...,^}. There are L — /c reads 

that cover in s, as illustrated in Lig. 5. These are the 

reads given by rcr-i(t_n), foi* ^ = 0, ^ — k — 1. Notice 

that read rcr-i(t_n) was originally obtained from the segment 
from the true sequence s. The consistency requirement 
on cr thus implies that dH{^^-i(j._^y^t-n) — Moreover, if 
we just focus on the {k + l)-block corresponding to we 

have < D, which 

holds for each t G {ti,G) and n = 0,..., L — k — 1. 












s 
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Fig. 5. For an arbitrary length-(A: + 1) block B in s, the L — A: reads that 
completely cover B according to the assembly a are shaded (in this example, 
L = Q and k = 2). By mapping these L — k reads back to s, we find the 
corresponding {k + l)-blocks in s given by for 3 = 0,1,2,3. 

Notice that, in this example, = cr~^{t — 2) + 2, because reads 

and r*o--i(t- 2)+2 aligned to each other in the same way in s and s. 

If we now consider the set of all such (k + 1)-blocks in s 

5 = : n = 0,L - fc - 1, z = 1,g} , (5) 

since dniy,^) < D for each y G 5, we have that p{S) < D. 
Hence, if we let 

T = {cF~^{ti — n)-\-n:0<n<L — k — l,l<i<q} 

be the starting positions of these blocks in s, T must satisfy 
17^1 < Ms{D, k Now consider the set of (n, i) pairs 

B = {(n, i):0<n<L — k — m}. 

We will define a partition on B according to the value of — 

n) n. More precisely, we will let 

Br = {{n,i) e B : — n) n = r} , 

for r G T. It is clear that {Br}TeT is a partition of B. We claim 
that there exist distinct ri, ...,rg G T such that \Brj \ > D ^ 1, 
for j = Suppose by contradiction that this is not the 

case, and we have at most q — 1 parts Br with \Br\ > D -\- 
1. Notice that, since a : ^ is one-to-one, 

— n) -f- n 7 ^ — n) + n if 7 ^ tj, and, for any 

T, we must have \Br\ < L — k. Therefore, since (4) implies 
L-k-l > D ^ Ms{D,k^l), 

Y,\Br\<{q-l){L-k) + {\T\-q+l)D 
rer 

< {q-l){L-k)^D-Ms{D,k^l) 

= q{L -k)-l. 

But since X^reT ~ 1^1 ~ q{L — k), we have a contradiction. 

Now consider the segments with \Br J > 1^ + 1, for 

j = l,...,q. Since ri, ...,rg are all distinct, these segments start 
at different points in s. Moreover, since | >1^ + 1, each 
gk+i jg covered by D -\-l reads from the reads that cover 
i = l,...q. Notice that these must be distinct reads from the 
multiset This is because two distinct pairs (n,i) and 

in Br must have n m, and the corresponding reads are 
= I*r-n and which are distinct 

reads (not necessarily different sequences from Finally, as 
illustrated in Fig. 6, we note that, since there are at most D 
erasures per base in s, we have that = x, for j = 1,..., 
We conclude that \S{U)\ > q. ■ 



Fig. 6. If \1Srj\ > B + 1, at least B + 1 of the reads that cover one of 
,..., in s also cover in s (in this example, B = 3). Since there 
are at most B erasures per base in s, we must have = x. 

VI. Concluding Remarks 

Our results show that for several actual genomes, if we are in 
a dense-read model with reads 20-40% longer than the noiseless 
requirement ^crit(s), perfect assembly feasibility is robust to 
erasures at a rate of about 10%. While this is not as optimistic 
as the message from [7], we emphasize that we consider an 
adversarial error model. When errors instead occur at random 
locations, it is natural to expect less stringent requirements. 

Another message provided by our results deals with error 
correction. Most current sequencing technologies employ error 
correction algorithms based on aligning reads to form clusters 
and outputing a cleaned-up read for each cluster. However, the 
spectrum error correction result from Theorem 3 suggests that 
a “global” approach to generating cleaned-up reads (based on 
finding a consistent assembly and looking at its spectrum) may 
perform better than cluster-based, or local, error correction. 

A direction for future work is to replace the dense-read model 
with a shotgun read model. While the L-spectrum approach 
is motivated by the high-throughput of current technologies, it 
bypasses the question of the actual coverage depth required for 
assembly. As was the case in [3], we expect the read length 
requirements from the dense-read model to translate into bridging 
conditions in the shotgun model, allowing one to compute the 
coverage required for perfect reconstruction with high probability. 
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